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GAMMA-BASED CLUSTERING VIA ORDERED MEANS 
WITH APPLICATION TO GENE-EXPRESSION ANALYSIS 

By Michael A. Newton and Lisa M. Chung 
University of Wisconsin at Madison 

Discrete mixture models provide a well-known basis for effec- 
tive clustering algorithms, although technical challenges have lim- 
ited their scope. In the context of gene-expression data analysis, a 
model is presented that mixes over a finite catalog of structures, each 
one representing equality and inequality constraints among latent 
D expected values. Computations depend on the probability that in- 

dependent gamma-distributed variables attain each of their possible 
orderings. Each ordering event is equivalent to an event in indepen- 
dent negative-binomial random variables, and this finding guides a 
dynamic-programming calculation. The structuring of mixture-model 
components according to constraints among latent means leads to 
strict concavity of the mixture log likelihood. In addition to its ben- 
eficial numerical properties, the clustering method shows promising 
results in an empirical study. 
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1. Introduction. A common problem in statistical genomics is how to 
organize expression data from genes that have been determined to exhibit 
differential expression relative to various cellular states. Cells in a time- 
ly course experiment may exhibit such genes, as may cells in any sort of de- 
qq signed experiment or observational study where expression alterations are 
CO being examined (e.g., Parmigiani et at, 2003; Speed, 2004). In the event that 
the error-rate-controlled list of significantly altered genes is small, the post- 
processing problem amounts to inspecting observed patterns of expression, 
investigating what is known about the relatively few genes identified, and 
planning follow-up experiments as necessary. However, it is all too common 
that hundreds or even thousands of genes are detected as significantly altered 
in their expression pattern relative to the cellular states. Post-processing 
these non-null genes presents a substantial statistical problem. Difficulties 
are compounded in the multi-group setting because a gene can be non-null 
in many different ways (Jensen et al. 2009). 

Ever since Eisen et al. (1998), clustering methods have been used to or- 
ganize expression data. Information about a gene's biological function may 
be conveyed by the other genes sharing its pattern of expression. Thala- 
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muthu et al. (2006) provides a recent perspective. Clustering methods are 
often applied in order to partition non-null genes which have been identi- 
fied in differential expression analysis (e.g., Campbell et al. 2006; Grasso 
et al. 2008). Popular approaches are informative but not completely sat- 
isfactory. There are idiosyncratic problems, like how to select the number 
of clusters, but there is also the subtle issue that the clusters identified by 
most algorithms are anonymous: each cluster is defined only by similarity 
of its contents rather than by some external pattern that its genes may be 
approximating. Anonymity may contribute to technical problems, such as 
that the objective function being minimized is not convex, and that real- 
ized clusters have a more narrow size distribution than is warranted by the 
biological system. 

Model-based clustering treats data as arising from a mixture of compo- 
nent distributions, and then forms clusters by assigning each data point to 
its most probable component (e.g., Titterington et al. 1985; McLachlan and 
Basford, 1988). For example, the mclust procedure is based on mixtures 
of Gaussian components (Fraley and Raftery, 2002); the popular K-means 
algorithm is implicitly so based (Hastie et al, 2001, page 463). There is con- 
siderable flexibility in model-based clustering, though technical challenges 
have also affected its development: the likelihood function is often multi- 
modal; identifiability can be difficult to establish (e.g., Redner and Walker, 
1984; Holzmann et al, 2006); and even where constraints may create identi- 
fiability, there can be a problem of label-switching during Bayesian inference 
(Stephens, 2000). Some sophisticated model-based clustering methods have 
been developed for gene expression (e.g., Medvedovic et al, 2004). Beyond 
empirical studies it is difficult to determine properties of such approaches, 
and their reliance on Monte Carlo computation is somewhat limiting. 

Here a model-based clustering method is developed that aims to support 
multi-group gene-expression analysis and possibly other applications. The 
method, called gamma ranking, places genes in a cluster if their expression 
patterns commonly approximate one element from a finite catalog of possi- 
ble structures, in contrast to anonymous methods (Section 2). Under certain 
conditions the component distributions are linearly independent functions 
- each one associated with a structure in the finite catalog - and this con- 
fers favorable computational characteristics to the gamma-ranking proce- 
dure (Sections 4,5). The cataloged structures record patterns of equality 
and inequality among latent expected values. Where normal-theory speci- 
fications seem to be intractable, a gamma-based mixture model produces 
closed formulas for all necessary component densities, thanks to an embed- 
ding of the relevant gamma-distributed variables in a set of Poisson processes 
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(Section 3). The formulation also extends to Poisson-distributed responses 
that are characteristic of gene expression measured by next-generation se- 
quencing (Section 6). 

2. Mixture of structured components. The data considered has 
a relatively simple layout. Each gene g from a possibly large number is 
associated with a vector x g = (x g ,i, x 9: 2, ■ ■ ■ , x g , n ) holding measurements 
of gene expression from n distinct biological samples. The n samples are 
distributed among 1 < p < n different groups, which represent possibly 
different transcriptional states of the cells under study. The groups may 
represent cells exposed to p different chemical treatments, cells at p different 
developmental stages, or cells at p different points along a time course, for 
example. The layout of samples {1,2,..., n} is recorded in a vector, say I = 
(li, I2, ■ ■ ■ , In)-, with li = j indicating that sample i comes from group j. This 
is fixed by design and known to the analyst; to simplify the development we 
suppress I from the notation below except where clarification is warranted. 

Each expression measurement x g ^ is treated as a positive, continuous 
variable representing a fluorescence intensity from a microarray, after pre- 
processing has adjusted for various systematic effects not related to the 
groupings of interest. Recent technological advances allow expression to be 
measured instead as an explicit abundance count. The mixture model de- 
veloped below adapts readily to this case (Section 6). 

Gamma ranking entails clustering genes according to the fit of a specific 
model of gene-level data. The joint probability density for a data vector 
x g , denoted p{x g ), is treated as a finite mixture over a catalog of discrete 
structures rj, each of which determines ordering constraints among latent 
expected values. More specifically, 



where ir v is a mixing proportion and the component density p(x g \rj) is de- 
termined through modeling. 

Each rj is a partition of group labels {1, 2, . . . ,p}, containing K„ subsets, 
that also carries an ordering of these subsets. For example, three structures 
cover the two-group comparison, denoted {(1)(2), (12), (2)(1)}. The nota- 
tion conveys both the partition of group means and the ordering of subsets 
within the partition. For instance, in rj = (2)(1) the expected expression 
level in group 2 is less than that of group 1; while rj = (12) indicates that 
both groups share a common latent mean. With p = 3 groups there are 13 
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structures, 

(123), (12)(3), (3)(12), (13)(2), (2)(13), (1)(23), (23)(1), 
(1)(2)(3), (2)(1)(3), (1)(3)(2), (2)(3)(1), (3)(1)(2), (3)(2)(1), 

and the number grows rapidly with the number of groups (Table [I}. A 
way to think about i? rd = {?7}> the catalog of these ordered structures on 
p groups, is to imagine p real values y = (2/1,2/2, •• • , Vp) and the possible 
vectors you would get by ranking y. Of course there are pi rankings if ties 
are not permitted, but generally there are far more rankings, and -ff rd is hi 
1-1 correspondence with the set of rankings of p numbers, allowing ties. 

Table 1 

The number of ordered structures, Bell+, as a function of the number of groups, p. This 
is ~^21, =1 (k\)S(p, k), where S(p,k) are Stirling numbers of the second kind. The Bell 
number of partitions of I, ■ ■■ ,p is included for comparison. 
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Bell+ 


Bell 


2 


3 


2 
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13 


5 
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75 


15 


5 


541 


52 


6 


4683 


203 



An ordered structure rj also dictates an association between sample labels 
i 6 {1, 2, ... j 7i } and levels of the latent expected values. The null structure 
rj = (12 . . .p), for example, entails equal mean expression across slip groups; 
all observations are associated with a single mean value (and we write K n = 
1). More generally there are K v > 1 distinct mean values, [i\ < fj>2 < ■ ■ ■ < 
Hk v , say. Without loss of generality we index the means by rank order. The 
association maps each i S {1,2, ... ,n} to some /i^; it amounts to a partition 
of the samples together with an ordering of the subsets within the partition 
matching the order of the latent means. We express this association with 
disjoint subsets a(rj, k), k = 1, 2, . . . , K v , and have k follow the order of the 
expected values. For example, suppose that samples {1, 2, . . . , 6} constitute 
two replicate samples in each of p = 3 groups, and rj = (13) (2) is considered 
to relate the group-specific expected values. (I.e., the gene is upregulated 
in group 2, and not differentially expressed between groups 1 and 3.) Then 
K n = 2, a(rj, 1) = {1, 2, 5, 6} and a(rj, 2) = {3, 4}. Subset a(rj, k) includes nk 
samples and induces gene-level statistics such as 

s g,k = x g,i and t g> k = x g ^. 

i&a{r),k) i£cr(n,k) 

The structure/partition notation is convenient in multi- group mixture 
modeling. For clarification, let us refer back to the layout notation and take 
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the replicates rj = {i : U = j}, which equal those samples in group j. 
Consider a gene that is completely differentially expressed relative to the p 
groups; that is, it assumes one of the p\ structures 77 in which K„ = p. It 
follows that each set rj equals exactly one of the subsets a(rj,k). (It would 
be 17(77, 1) if rj had the lowest mean expression level, for instance.) In the 
absence of complete differential expression, multiple groups share expected 
values. Generally, therefore, each subset 17(77, k) is a union of various replicate 
sets rj. The language also conveys the assumption that replicates i\ and 12 in 
the same set rj necessarily share expected value, regardless of the structure 
77. In calculating probabilities, the sets 17(77, k) of equi-mean samples are more 
important than the replicate sets rj. 

From the mixture model ([!]), posterior structure probabilities are p{r\ \x g ) = 
p(xg\r])TT v /p(x g ) and these determine gene clusters by Bayes's rule assign- 
ment. Alternatively the cluster contents can be regulated by a threshold 
parameter c, and 

(2) cluster^) = {g : p(r)\x g ) > c} , 

though some genes may go unassigned in this formulation. In any case each 
cluster holds genes with empirical characteristics matching some discrete 
mean-ordering structure. 

The latent expected values are constrained by 77 to the order \i\ < fi2 ■ ■ ■ < 
fj-K v - Propeling our calculations is the ability to integrate these ordered 
means (i.e., marginalize them) in a model involving gamma distributions 
on some transformation of the yu^'s. Recall that a gamma distribution with 
shape a > and rate A > 0, denoted Gamma(a, A), has probability density: 

. . A a z a - 1 exp{-zA} 
We assume that inverse means ^ = 1/nk have joint density 



(3) Pt,(iPi,---,iI>k v ) = K v l 



K 



r(a„) 



fe=l 

xl [t/ji > i/i 2 ... > tpK v ] 

which reflects independent and identically distributed Gamma(ao, olqVq) com- 
ponents, conditioned to one ordering. Using this parameterization gives vq 
an interpretation as a centering parameter; on the null structure having a 
single latent mean /ji, 1/uq = E(\/(jl\). 
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To complete the hierarchical specification we assume a gamma observation 
model 



(4) p(Xg\l/>l,...,l/>K v ,V) =H ll 



{aip k ) a Xg A 1 exp{-x g>i ip k a} 



r(a) 

fc=l i&a(ri,k) y ' 

11 (r(«))" fc ' 

Equivalently, with sample i S a(r],k), measurement is distributed as 
Gamma(a,a4)j ah conditionally on the latent values and rj, and indepen- 
dently across samples. The gamma observation component is often sup- 
ported empirically; there is theoretical support from stochastic models of 
population abundance (Dennis and Patil, 1984; Rempala and Pawlikowska, 
2008); and there are practical considerations that a gamma-based model 
may be the only one for continuous data in which ordering calculations are 
feasible. 

The structured component p(x g \rj) in Q arises by integrating Q against 
the continuous mixing distribution (|3]). Specifically, 



P{ x g\v) = I P{Xg\lpl, ■ ■ ■ ,^K V ,V) Pv&l, ■ ■ ■ ,^K V ) (Ifa . . .dlp Kri . 

Moving allowable factors from the integral, 
P(x\n) - MfW^ [ TT j ke -i 

pww - r(ao)^r(a)» Ul Jfc ^ 



,fc=l 



V ^a +an k 1 exp {_^ fc ( aQUQ + a s gk )} 

} [ j dipt... dip Kv 



E 



k=l 



Jk 



where the integral is over the set E of decreasing if} k s, and where J k repre- 
sents any cluster-specific quantity which does not depend on ip k . Choosing 

r(a + an k ) 

Jk ~ 



(a ^o + as gM )^+ an k 

provides just the right normalization, because then the integrand becomes 
the joint density of independent gamma-distributed variables, with the fcth 
variable having shape a k = ao + otn k and rate X k = ao^o + as 9 ,fc- The 
integral itself, denoted -Pordl 7 ?); is the probability that independent gamma- 
distributed variables assume a certain order. The preceeding factor can be 
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arranged as products of the product statistics t 9t k multiplied by factors in- 
volving the sum statistics s g ^- After a bit of simplification the following 
result is established. 

Theorem 1. In the model defined above, the component density p(x g \rj) 
equals 



where the 's are mutually independent gamma-distributed random vari- 
ables with shapes a/t = ao + ank and rates \t = olqUq + as g ^, and where the 
normalizing constant is 



In |5|) ; P rd(v) = 1 for the null case involving K n = 1. 

The null structure r] = (12 ■ ■ - p) entails equal mean expression for all sam- 
ples; there is a single partition element, and K„ = 1. In this case the distri- 
bution in ([5]) is exchangeable and equals a multivariate compound gamma 
(Hutchinson, 1981). The positive parameters a and «o regulate within-group 
and among-group variation, and i/q is a scale parameter. Inspection also con- 
firms that if the random X = (X±, . . . , X n ) has density p(x\r)) in ([5]), and if 
b > 0, then Y = (bX%, . . . , bX n ) has a density of the same type, with shape 
parameters ao and a unchanged, but with scale parameter bvQ. 

Special cases of the density ^ have been reported: Newton et al. (2004) 
presented the case p = 2; Jensen et al. (2009) presented the case p = 4. 
See also Yuan and Kendziorski (2006a). Evidently an algorithm to compute 
-ford( r ?) is required in order to evaluate the component mixing densities. 
Beyond the p = 2 case, previous reports have evaluated these gamma-rank 
probabilities by Monte Carlo. 

Figure [l] displays contours of the three structured components when n = 2 
and p = 2. Clearly the components distribute mass quite differently from one 
another, and in a way that reflects constraints encoded by n. The densities 
from different structures r\ have the same support; the constraints restrict 
latent expected values rather than observables. In this way the approach 
shares something with generalized linear modeling wherein responses are 
modeled by generic exponential family densities and covariate information 
constrains the expected values (McCullagh and Nelder, 1989). 
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Fig 1. Three structured components in R 2 . Here a — 10, a?o = 3, and vq — 2 5 . Contours 
cover 50%, 80%, 95%, and 99% probability. For convenience each density is shown for 
log 2 -transformed pairs. 
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3. Gamma-rank probabilities. A statistical computing problem must 
be solved in order to implement gamma ranking. Specifically, it is required 
to calculate the probability P(E) of the event 

(6) E = {Zi > Z 2 > . . . > Z K } 

where {Z^ : k = 1,2, ... , K} are mutually independent gamma distributed 
random variables with possibly different shapes a±, a 2 , ■ ■ ■ , clk and rates 
Ai, A2, . . . , Afc. (Each P OT d(v) i n Section 2 is an instance of P{E).) In the 
special case K = 2, the event in two gamma-distributed variables is equiva- 
lent to the E' = {B > A2/(Ai + A2)}, where B is a Beta(ai,a2) distributed 
variable. Thus P(E) = P(E') can be computed by standard numerical ap- 
proaches for the Beta distribution. Although a similar representation is pos- 
sible for Dirichlet-distributed vectors when K > 2, a direct numerical ap- 
proach is not clearly indicated. In modeling permutation data, Stern (1990) 
presented a formula for P(E) for any value K, but assuming common shape 
parameters = a. Sobel and Frankowski (1994) calculated P(E) for K < 5 
and assuming constant rates A& = A, but to our knowledge a general formula 
has not been developed. A Monte Carlo approximation is certainly feasible, 
but a fast and accurate numerical approach would be preferable for compu- 
tational efficiency: target values may be small, and P(E) may need to be 
recomputed for many shape and rate settings. 

There is an efficient numerical approach to computing P(E) when shapes 
afc are positive integers. The approach involves embedding {Z^} in a collec- 
tion of independent Poisson processes {N&}, where k = 1, 2, . . . , K. Specifi- 
cally, let Nfc denote a Poisson process on (0,oo) with rate Afe. So Nfc(0,t] ~ 
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Poisson(iAfc), for example. Of course, gaps between points in r% are inde- 
pendent and exponentially distributed, and the gamma-distributed Z k can 
be constructed by summing the first a k gaps, 

Z k = m.m{t >0:N fc (0,t] > a k } . 

Next, form processes {M^} by accumulating points in the originating pro- 
cesses: M. k = ^j=i^j- Marginally is a Poisson process with rate A k = 
Y^j=i ^j' but over ^ the processes are dependent owing to overlapping points. 
To complete the construction define count random variables Mi , M2 , • • • , Mk- 
by 

(7) M k = M k (0, Z k+1 ] . 

It is immediate that each M k has a marginal negative binomial distribution: 
the gamma distributed Z k+ \ is independent of ~M k ; conditioning on Z k+ \ 
in gives a Poisson variable which mixes out to the negative binomial 
(Greenwood and Yule, 1920). Specifically, 

M k ~ NB (shape = a k+1 , scale = A k /X k+1 ) , 

which corresponds to the probability mass function 



T(m + a k+ i) ( A fc+i \ 



a k+l 



A 



k 



(8) Pk{m) " T(a k+1 )T{m + l)\K k+l J \A k+l 
for integers m > 0. The next main finding is 

Theorem 2. With E as in |6p, M k as in |?p, and p k as in M), P{E) 
equals 

ai— 1 mi+a2 — 1 m K-2+ a K-i — 1 

(9) ^2 ^2 '"■ Pi(^i)P2(m 2 ) •■ • Pif-i(mjf-i). 

mi=0 m,2=0 m^-_!=0 

It does not seem to be obvious that E in ^ is equivalent to an event in 
the {M k }. We also find it striking that the M k variables are independent 
considering that they are constructed from highly dependent M. k processes. 
Proof of ([9]) and the related distribution theory are presented in Appendix A. 

A redistribution of products and sums allows a numerically efficient evalu- 
ation of @, as in the sum-product algorithm (e.g., Kschischang et al. 2001). 
For instance with K = 4, 




"11+12—1 



(10) P{E) = i E V2{m 2 ) 

?T12=0 



"12+03— 1 

. "13=0 
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Here one would evaluate P(E) by first constructing for each 777,2 6 {0, 1, . . . , ai+ 
02 — 2} an inner sum P{M% < mi + 03 — 1). This vector in 777,2 values is 
used to process the second inner sum, for each value mi G {0, 1, . . . , a\ — 1}. 
Indeed the computation is completely analogous to the Baum- Welch back- 
ward recursion (e.g., Rabiner, 1989), although, interestingly, there seems to 
be no hidden Markov chain in the system. A version of the Viterbi algo- 
rithm identifies the maximal summand and thus provides an approach to 
computing \ogP(E) in case P{E) is very small. 

4. Linear independence. The component densities ^ seem to have 
the useful property of being linearly independent functions on M n . Linear 
independence of the component density functions is equivalent to identifia- 
bility of the mixture model (Yakowitz and Spragins, 1968). It is necessary 
for strict concavity of the log-likelihood, but it is not routinely established. 
Establishing identifiability also is a key step in determining sampling prop- 
erties of the maximum likelihood estimator. 

Let a = (o„) denote a vector of real numbers indexed by structures 77. 
Recall that the finite catalog of functions {p(x\r])} is linearly independent if 

T a (x) = a vP( x \ n ) = f° r an x implies a v = for all 77. 

v 

It is plausible that this property holds generally, but we have been able to 
establish a proof only in a special case. 

Theorem 3. In a balanced experiment where m replicate samples are 
measured in each of p = 2 or p = 3 groups, the component densities p(x g \n) 
in |5p are linearly independent functions on M. mp . 

A proof proceeds by finding a multivariate polynomial <p{x) > such that 
(j){x)T a {x) is itself a multivariate polynomial. A close study of the degrees 
and coefficients of this polynomial leads us to the result (Appendix B). 
That such a (j>{x) exists follows from ([5]): the center is a rational function, 
and the factor -P rd( 7 ?) is also rational, being a linear combination of rational 
functions, as established in Q. 

5. Data analysis considerations. 

5.1. Estimation. To deploy model ([l] [5]) requires the estimation of pa- 
rameters a, ao and vq, which are shared by the different components, as 
well as mixing proportions n = {tt^}, which link the components together. 
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Consider first the log likelihood for ir alone (treating the shared parameters 
as known) under independent and identically distributed sampling from ([!]) : 

(11) l(n) =J2 lo s{j2 7r vP^ 9 \v)) 

9=1 I V ) 

where G is the number of genes providing data. Maximum likelihood esti- 
mation of 7r is buttressed by the following finding. 

Theorem 4. Suppose that the component densities are linearly inde- 
pendent functions in the mixture of structured components model. If G is 



sufficiently large, then the log likelihood l(ir) in (11) is strictly concave on a 
convex domain, and thus admits a unique maximizer n = {n v }. This prop- 
erty is almost sure in data sets. 

The expectation-maximization (EM) algorithm naturally applies to ap- 
proximate 7T. By strict concavity of /(vr), it is not necessary to re-run EM 
from multiple starting points. The final estimate and resultant clustering 
should be insensitive to starting position, as has been found in numerical 
experiments. This is a convenient but unusual property in the domain of 
mixture-based clustering (McLachlan and Peel, 2000, pg 44). 

In a small simulation experiment we confirmed that our implementation of 
the EM algorithm was able to recover mixture proportions given sufficiently 
many draws from the marginal distribution ([I]) (data not shown). 

Full maximum likelihood for both the mixing proportions and shared pa- 
rameters is feasible via the EM algorithm, but this increases computational 
costs. In the prototype implementation used here, we fixed the shared param- 
eters at estimates obtained from a simpler mixture model, and then ran the 
EM algorithm to estimate the mixing proportions. Specifically, we used the 
gamma-gamma method in EBarrays (www.bioconductor.org), which corre- 
sponds to mixing as in 

but over the smaller set of unordered structures. 
Experiments indicated that this approximation had a small effect on the 
identified clusters (see Appendix D). 

Inference derived through the proposed parametric model is reliant to 
some degree on the validity of the governing assumptions. Quantile-quantile 
plots and plots relating sample coefficient-of-variation to sample mean pro- 
vide useful diagnostics for the gamma observation-component of the model. 
The within-component model is restrictive in the sense that three param- 
eters are shared among all the components (i.e., structures). This can be 
checked by making comparisons of inferred clusters, but only large clusters 
would deliver any power. Clusters reveal patterns in mean expression, while 
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the shared parameters have more to do with variability; if other domains of 
statistics provide a guide, one expects that misspecifying the variance may 
reduce some measure of efficiency without disabling the entire procedure. 
The ultimate issue is whether or not the clustering method usefully repre- 
sents any underlying biology. This is difficult to assess, though we examine 
the issue in a limited way in the examples considered next. 

5.2. Example. Edwards et al. (2003) studied the transcriptional response 
of mouse heart tissue to oxidative stress. Three biological replicate samples 
were measured using Affymetrix oligonucleotide arrays at each of five time 
points (baseline and one, three, five, and seven hours after a stress treat- 
ment) for several ages of mice. Considering the older mice for illustration, 
we havep = 5 distinct groups, n = 15 samples, and 10, 043 genes (i.e., probe 
sets, after pre-processing). Gene-specific moderated F-testing (Smyth, 2004) 
produced a list of G = 786 genes that exhibited a significant temporal re- 
sponse to stress at the 10% false discovery rate (by q-value; Storey and 
Tibshirani, 2003). Gamma ranking involved fitting the mixture of struc- 
tured components, which with p = 5 mixes over 540 distinct components. 
(Since we worked with significantly altered genes, we did not include the null 
component in which all means are equal; other aspects of model fitting and 
diagnosis are provided in Appendix D). From the catalog of 540 possibilities, 
genes populated 23 clusters by gamma ranking, though only four clusters 
contained 10 or more of the G = 786 stress-responding genes (Figure 2). 
Most expression changes occurred between baseline and the first time point, 
but 30 genes (red cluster) showed significant up-regulation at all but one 
time point, for example. 

Gamma ranking gave different results than K-means or mclust, which 
respectively found 20 and 2 clusters in Edwards' data. Here K was chosen 
according to guidelines in Hastie et al. (2001); mclust used the Bayes infor- 
mation criteria over the range from 1 to 50 clusters. Otherwise, both methods 
used default settings in the respective R functions (www.r-project.org). The 
adjusted Rand index (Hubert and Arabie, 1985), which measures dissimilar- 
ity of partitions, was 0.09 comparing gamma ranking and K-means, 0.16 for 
gamma ranking and mclust, while for K-means and mclust it was smaller, 
at 0.02. 

The biological significance of clusters identified by any algorithm may be 
worth investigating. For example, the cluster of 30 increasing expressors in- 
cludes 2 genes (Mgstl & Gsta4) from among only 17 in the whole genome 
that are involved in glutathione transferase activity. Understanding the in- 
creased activity of this molecular function will give a more complete picture 
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Fig 2. Dominant patterns of differential expression in time course data from Edwards 
et al. (2003). Each panel summarizes data from one cluster identified by gamma ranking 
(the nine largest clusters are shown). A digital code signifies the inferred ordering of the 
latent expected values (i.e., rj, in an alternative notation). Each gene is a single line trace; 
triplicate measurements were reduced by averaging and then standardized for display; raw 
data went into the model fitting. Results are based on 100 cycles of EM to estimate mixing 
proportions followed by Bayes 's rule assignment. 




14 NEWTON, CHUNG 

Fig 3. Characteristics of clusters from an empirical study of 11 data sets. 
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of the biology (e.g., Girardot et al. 2004). In isolation it is difficult to see 
how such investigation is supportive of a given clustering approach. The 
benefits become more apparent when we look at many data sets and many 
functional categories. 



5.3. Empirical Study. Gamma ranking was applied to a series of 11 data 
sets obtained from the Gene Expression Omnibus (GEO) repository (Edgar 
et al, 2002). These were all the data sets satisfying a specific and relevant 
query (Table [2]). They represent experiments on different organisms and 
they exhibit a range of variation characteristics. In each case we applied the 
moderated F test and selected genes with q- value no larger than 5%. Gamma 
ranking, and, for comparison, mclust and K-means, were applied in order 
to cluster genes separately for each data set. Basic facts about the identified 
clusters are reported in Table [2} Figure 3 shows that gamma ranking tends 
to produce smaller clusters than mclust and K-means, although it also has a 
wider size distribution; and there was a relatively low level of overlap among 
the three approaches. 

The empirical study shows not only that gamma ranking produces sub- 
stantially different clusters than popular approaches, but also that the iden- 
tified clusters are significant in terms of their biological properties. Investi- 
gators often measure the biological properties of a gene cluster by identifying 
functional properties that seem to be over-represented in the cluster. Gene 
set enrichment analysis is most frequently performed by applying Fisher's 
exact test to each of a long list of functional categories, testing the null hy- 
pothesis that the functional category is independent of the gene cluster (e.g., 
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Newton et ai, 2007). Functional categories from the Gene Ontology (GO) 
Consortium and the Kyoto encyclopedia (KEGG) were used to assess the 
biological properties of all the clusters identified in the above calculation. 
Specifically, we computed for each cluster a vector of p-values across GO 
and KEGG. Figure 4 shows the proportion of these p-values smaller than 
0.05, stratified by cluster size and in comparison to results on random sets of 
the same size. Evidently, the clusters identified by gamma ranking contain 
substantial biological information. 

Figure 4 also shows that mclust clusters carry substantial biological infor- 
mation, and a similar result is true for K-means (not shown). Whatever clus- 
ter signal is present in the expression data, it is evident that gamma-ranking 
finds different aspects of this signal than do the standard approaches, while 
still delivering clusters that relate in some way to the biology. Gamma- 
ranking clusters are not anonymous sets of genes with similar expression 
profiles; they are sets of genes linked to an ordering pattern in the underly- 
ing means. The commonly used clustering methods are unsupervised, while 
gamma-ranking utilizes the known grouping labels in the sample. It seems 
beneficial to use this grouping information; undoubtedly various schemes 
could be developed. By their construction the gamma-ranking clusters have 
a simple interpretation in terms of sets of genes supporting particular hy- 
potheses about changes in mean expression. 

6. Count data. Microarray technology naturally leads to continuous 
measurements of gene expression, as modeled in Section 2, but technological 
advances allow investigators essentially to count the number of copies of each 
molecule of interest in each sample (e.g., Mortazavi et al. 2008). Poisson 
distributions are central in the analysis of such data (e.g., Marioni et al. 
2008), and gamma ranking extends readily to this case. 

Briefly, data at each gene (or tag) is a vector x g = • • • , x gtTl ) as be- 
fore, but x 9t i is now a count from the ith library (rather than an expression 
level on the ith microarray). There may be replicate libraries within a given 
cellular state, and comparisons of interest may be between different cellular 
states. Library sizes {-/Vj}, say, are additional but known design parameters. 
Important parameters are expected counts relative to some common library 
size. Adopting the notation from Section 2, a cluster of libraries a(ij, k) may 
share their size-adjusted expected values, and so for any i 6 a(r], k) the ob- 
served count from the Poisson distribution with mean Ni/j, k . Fur- 
ther, the structure r\ on test puts an ordering constraint n\ < [12 ■ ■ ■ < Hk^ 
on these latent expectations. The key is to integrate away these latent ex- 
pected values using a conjugate gamma prior, but conditionally on the or- 
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Fig 4. Empirical study of the association between clusters and biological function. For every 
cluster identified by gamma ranking (red) or mclust (green) in the data sets in Table 2, 
plotted is the proportion of small enrichment p-values (vertical) versus the cluster size 
(horizontal) . The enrichment p-values are Fisher- exact-test p-values and the proportion is 
computed over a database of GO and KEGG pathways (Table 7). Bands indicate similar 
proportions computed for random sets. 
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dering. Prior to conditioning, the /U&'s are independent and identical gamma 
variables with (integer) shape ao and rate ao^o- Then, analogously to The- 
orem 2, the predictive distribution for the vector of conditionally Poisson 
responses is: 



(12) p(x g \r))=c v 



contor(»7) 

where 

Pord(v) =P[Zl <Z 2 <...< Z Kv ) 

with the Zk's mutually independent gamma-distributed random variables 



with (gene-specific) shapes a k = aQ + s g ^ and rates = aQVQ + n k . In (12), 
the normalizing constant is 



[r(a )] " k=1 



-ao 



and, further, s gM = Eieafo,*) x g,h n k = £ie«rfa,fc) N h and 

ii (^-Y' J . 

rGa(ri,k) 

Notice that in P rd( r ?) the event refers to an increasing sequence of gamma's, 
rather than a decreasing sequence, as in Theorem 1. This arises because for 
Poisson responses the conjugate prior involves a gamma distribution for the 
means, whereas for gamma responses the conjugate is inverse gamma on the 
means. For computations to work out the key thing is that some monotone 
transformation of each latent mean has a gamma distribution. In the null 



structure (all means equal), i-'ord( r /) = 1 an d (12) reduces to the negative 



multinomial distribution. It will be important to study the practical utility 



of (12) and overdispersed extensions (c/. Robinson and Smyth, 2007), but 
such investigation is not within the scope of the present paper. The main 
reason to present the finding here is to show that gamma-rank probabilities 
(Section 3) arise in multiple probability models. 

7. Concluding Remarks. Calculations presented here consider a dis- 
crete mixture model and the resulting clustering for gene-expression or sim- 
ilar data types. The discrete mixing is over patterns of equality and inequal- 
ity among latent expected values (ordered structures). Clustering by these 
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patterns addresses an important biological problem to organize gene expres- 
sion relative to various cellular states, which is part of the larger task to 
determine biological function. In examples the method was applied after a 
round of feature selection, although it could have been applied to each full 
data set (i.e., by including the null structure in the mix) and it could have 
been the basis of more comprehensive analysis, going beyond clustering and 
towards hypothesis testing and error-rate-controlled gene lists. Our more 
conservative line is attributable in part to an incomplete understanding of 
the method's robustness. Relaxing the fixed-coefficient-of-variation assump- 
tion, as in Lo and Gottardo (2007) or Rossell (2009), could be considered to 
address the problem. The focus on clustering, however, is motivated largely 
by its practical utility in the context of genomic data analysis. 

By cataloging ordered structures, rather than the smaller set of unordered 
structures, the mixture model produces readily interpretable clusters in 
the multi-group setting. Jensen et al. (2009) argues similarly. For exam- 
ple, the largest cluster of temporally responsive genes in Edwards' data are 
upregulated immediately after treatment and show no significant fluctua- 
tions thereafter. The development of calculations for ordered structures has 
been more challenging than for unordered structures, which were presented 
in Kendziorski et al. (2003) and implemented in the Bioconductor package 
EBarrays. Mixture calculations are simplified in the unordered case because 
component densities reduce by factorization to elementary products (i.e., the 
last factor in ^ is not present). The requirement to compute gamma-rank 
probabilities had limited a fuller development. 

Gamma ranking produces clusters indexed by patterns of expected ex- 
pression rather than anonymous clusters defined by high similarity of their 
contents. A referee noted that large gamma-ranking clusters may tend to 
swallow up genes more easily than small clusters because the estimated pos- 
terior assignment probability is proportional to the estimate of the mixing 
weight ir v : i.e. structures with large tt v have a head start in the allocation of 
genes. On one hand, this provides an efficiency which may be advantageous 
for genes that have a relatively weak signal (and which otherwise might be 
assigned to a more null- like structure). It also implies that small clusters 
are more reliable, in a way, since the assigned genes have made it in spite 
of the small tt„. Another feature of gamma ranking is that clusters can be 
tuned by a threshold parameter c, as in ([2]) , rather than being determined by 
Bayes's rule assignment. Taking c close to 1 tends to purify the clusters; the 
more equivocal genes drop into an unassigned category. Empirically, such 
swallowing up may not be substantial; at least in comparison to the simpler 
clustering methods analyzed, gamma ranking produces more and smaller 
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clusters. 

There is nothing explicit in gamma ranking that attends to the temporal 
dependence which might seem to be involved in time-course data. Indepen- 
dent cell lines were grown in the Edwards' experiment, one for each microar- 
ray, and so there is independent sampling in spite of a time component. 
Additionally, the model imposes dependencies (in [5]) driven by whichever 
structure r\ governs data at a given gene. If there were complicated tem- 
poral dependence, the identified clusters would still reflect genes that act 
in concert in this experiment; they might act in concert by a different r/ in 
another run of the experiment, and we would not be confident in the fitted 
proportions, even though the clusters may continue to be informative. Nei- 
ther does the model have explicit dependence among genes; but it produces 
clusters of genes that seem to be highly associated (genes that realize the 
same structure r/ seem to present correlated data). This shows that a suf- 
ficiently rich hierarchical model, based on lots of conditional independence, 
can represent characteristics of dependent data. Of course care is needed 
since the sampling distribution of parameter estimates is affected by the 
intrinsic dependencies within the data generating mechanism. 

The mixture framework from Kendziorski et al. (2003) has supported a 
number of extensions to related problems in statistical genomics: Yuan and 
Kendziorski (2006b) (time-course data), Kendziorski et al. (2006) (mapping 
expression traits) and Keles (2007) (localizing transcription factors). The 
ability to monitor ordered structures may have some application in these 
problems. Further, the ability to compute gamma-rank probabilities may 
have application in distinct inference problems (e.g., Doksum and Ozeki, 
2008). Future work includes developing a better software implementation of 
gamma ranking, enabling the implementation to have additional flexibility 
(e.g., gene-specific shapes a), studying the method's sampling properties, 
and exploring extensions to emerging data sources. 
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Table 2 

Summary of 11 data sets from the Gene Expression Omnibus (GEO). GDS is the GEO 
data set accession number. These sets satisfied the search query from August 2008 having 

subset variable type time or development stage or age and having a single factor with 
three to eight levels, p indicates the number of groups and n is the number of samples. G 

indicates the number of genes deemed significantly altered by one-way moderated F test 

and 0.05 FDR (limma). The remaining columns show how many clusters are found by 
gamma ranking with 100 EM iterations (GR), mclust (MC) and K-means (KM). 



GDS 


citation 


organism 


P 


n 


G 


GR 


MC 


KM 


2323 


Coser et al. 


Homo Sapiens 


3 


9 


1409 


11 


5 


13 


1802 


Tabuchi et al. 


Mus musculus 


4 


8 


3433 


49 


7 


10 


2043 


Tabuchi et al. 


Mus musculus 


4 


8 


3001 


51 


8 


18 


2360 


Ron et al. 


Mus musculus 


4 


9 


8714 


50 


8 


30 


599 


Vemula et al. 


Rattus norvegicus 


5 


10 


673 


42 


2 


40 


812 


Zeng et al. 


Mus musculus 


5 


17 


10982 


135 


7 


15 


1937 


Pilot et al. 


Drosophila 


5 


15 


7733 


88 


8 


10 


568 


Welch et al. 


Mus musculus 


6 


18 


3737 


134 


4 


25 


2431 


Keller et al. 


Homo Sapiens 


6 


18 


8505 


137 


9 


12 


587 


Tomczak et al. 


Mus musculus 


7 


21 


860 


50 


2 


20 


586 


Tomczak et al. 


Mus musculus 


8 


24 


5211 


118 


5 
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Appendix A. Proof of Theorem 2: Let gk(z) denote the density of a 
gamma distribution with shape ak and rate A&. By definition 



P(E) 



OO /"OO 



ZK 



' Z2 



K 



k=l 



dzi • • • dzx-i dzx 



poo poo 1*00 

/ 9k{zk) 9k-i(zk-i) ■ ■ ■ gi{zi)dzi ■■■ dz K _idz K 

JO J Zk J 22 



where in the second line we move factors in the integrand as far as possible 
to the left. With this in mind we construct functions fk(z), z > 0, recursively 
as fo(z) = 1 and, for k = 1, 2, . . . , K, 



(13) 



/•oo 

fk(z)= fk-i(u) 9k(u) du, 

J z 



and we observe that P(E) = /r-(O). Evaluating these functions further, we 
see 

/oo 

= P(Z 1 >z) 

= P[M x <a x \Z 2 = z) 
01— 1 

mi=0 

Here Mi = Mi(0, Z2) is Poisson(Aiz) distributed conditionally upon Z2 = z, 
and po() indicates the Poisson probability mass function with the indicated 
parameter. The equivalence in the second and third lines above stems from 
basic relationships between objects in the underlying Poisson processes. As 
long as Mi is small, it means that the f% process has not accumulated many 
points up to time Zi = z, and hence the Z\ value must be relatively large. 
More basically, 



(14) 



P(U >u) = P(X < a) 



when U ~ Gamma(a, A) and X ~ Poisson(Au), for integer shapes a. 
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Proceeding to f2(z), 

h{z) = I fi(z 2 ) g 2 (z 2 ) dz 2 



a l J- poo 

^ J po(mi;AiJ2r 2 ) 52(^2) dz 2 



mi=0 

= V Pl ( mi ) r po^jAiza); ^(za) 

Here pi(m{) is the probability mass function of a negative-binomial distribu- 
tion, as in Q. Indeed we have reorganized the summand above to highlight 
that integrand on the far right is precisely the density function of a gamma 
distributed variable with shape a 2 + m\ and rate Ai + \ 2 . This represents 
the Poisson-Gamma conjugacy in ordinary Bayesian analysis (e.g., Gelman 
et al. 2004, pages 52-53). The integral evaluates to 1 if z = 0, and hence 
we have proved the case K = 2. But furthermore the integral represents the 
chance that a gamma distributed variable is large, and so by ( 14 ) 

Oi— 1 mi+a2 — l 

f2(z) = ^ Pl( m l)P°( m 2;( A l + A 2)20 

mi=0 m.2=0 
a\ — 1 mi+a2 — 1 

= ^ ^ pi(mi)po(m 2 ; A 2 z) 

mi=0 m2=0 

The baseline result of an induction proof has been established. Assume that 
for some k > 3, 

ai-l m fe _2+afc-i-l / k-2 \ 

(15)/ fc _i(z) = ^ ••• ^ I JJp 3 -(m 3 -) J po(m fc _i;Afc_i^) 

mi=0 m fc _i=0 \i=l / 
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and then evaluate ( 13 ) to obtain 

/>oo 

fk(z) = / fk-i(zk) 9k(z k ) dz k 



oo ai-1 m fe _ 2 +a fc _i-l /fc-2 
m k _ 1 =0 \j=l 



mi=0 



ai-1 mfc-2+afc-i-l /fc-2 



(<! J. " v /c — ^ i ""ft; — ± # A/ _ » 

£■■• E II- / 

~ n ™_ — n \ 1 / " £ 



mi=0 



m fe _!=0 \j=l 



ai-1 m fc _2+a fe „ 1 -l /fc-l 

£•■■ E 1I^<", 



mi=0 



m fe _!=0 \i=l 



po(mfc_i; Kk-iZk) gk{zk) dz k 
' po(m k -i; Ak-iZk) gk(zk) 



ai -l m fc _2+a fc _ 1 -l /fc_l 

E- E II /^<",: 



mi=0 



m fe _i=0 \j=l 



p k -i(m k -i 



m fc _ 1 +a fe -l 

E po(m fc ;A/ c z) 

m fc =0 



ai-1 m fc _!+a fe -l / 



mi=0 



E II^'( m j) I p°( m fc; A fc z ) 

m k =0 \i=l 



which establishes that (15) is true for all k. Evaluating at k = K and z = 
establishes the theorem. □ 



Coda: Further insight is gained by realizing from the definition of the counts 
that 

Mk(Zk) = Uk-i(Z k ) + a k 
= M k -i+a k . 

But also Mfc has a jump at Z k , and so we see the equivalence 

(16) Z k > Z k+ i M k < M k -i + a k . 

The event E is an intersection of these pairwise events, and this is man- 
ifested in the ranges of summation in Q. In contrast to ([9]), these event 
considerations give P{E) equal to 

ai— 1 mi+a 2 -l mK-2+aK-i— 1 

( 17 ) EE'" E Pjoint(mi,m 2 ,- •• ,mK-i)- 

mi=0 m2=0 mj{_ 1 =0 

The implication seems to be that M\, M2, • • ■ , M^-i are mutually indepen- 
dent, though Theorem 1 does not confirm this because the factorization into 
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negative binomials is required for all arguments, beyond what is shown. It is 
a conjecture that the {M/J are mutually independent. A proof by brute force 
evaluation in the special cases K = 3 and K = 4 is available (not shown), 
but we have not found a general proof. The fact is somewhat surprising 
because the {M^} processes are highly positively dependent. The indepen- 
dence seems to emerge as a balance between this positive dependence and 
the negative association created by being inversely related to Mj.(0, t]. □ 
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Appendix B. Linear independence and proof of Theorem 3. Consider 
the three-dimensional case, and initially consider a single replicate in each 
of the three groups. Data on each gene form the vector (x,y,z), say, of 
three positive reals. Thirteen component densities p(x, y, z\rf) constitute the 
mixture model (Table [3]). For a vector a = (a n ) of reals, the test function is 
T a (x,y,z) = ^2 rj a ri p{x^y, z\rj). It needs to be shown that if T a (x,y,z) = 
for all x,y, z > 0, then a v = for all structures rj. Specializing ^ to this 
case, and eliminating the positive factor (xyz) 0-1 , we see that T a (x, y, z) = 
is equivalent to 

(18) ^ a v c v center^) P ord (??) = 0. 

v 



Table 3 

Thirteen structured components p(x, y, z\rj) = c v (xyz) a ~ 1 center(ri)P OI d(rj) in the 
three-dimensional, no-replicate case. The forms have been simplified, WLOG, by taking 
the scale i/q = 1, by writing f3 — cto + a and £ = eta /a. Normalizing constants c v are as 
in |^]). Note that the e m and e m , n stand for constants (not involving x,y,z), but possibly 

differing among rows. 



structure r\ 


[center(r7)] 1 


Pord(ri) 


(123) 


(x + y + z + ^ +2a 








1 


(12)(3) 


(z + y + + 






Z-<m=0 


-1 e m (z + e)' 3 ( ;I: + ! / + C) m 
( a:+ j /+z+ 2g)/3+ m 


(3)(12) 


it 






V»/9-l 
Zjm = 


e m (z + O m (a: + !/+e) /3 + a 








( a: + ;j + z + 2l;)< 3 + < *+ m 


(13)(2) 


(x + z + ^+«(y + O 






v-^/3+a- 

Z^ 77l= o 


-1 e m (y+0"(* + z + O m 
( a:+!/+z+ 2e)^ + "> 


(2)(13) 








Zum=0 










( x + y + z + 2£)P + <x + rn. 


(1)(23) 


{y + z + iY+ a {x + tf 






Z-(ni=0 


em(i+£) m (9+2+«)' i+0 






(iE+i/ + z + 2£)' 3 + Q + " 1 


(23)(1) 








Z-<m=0 


-1 e m (x+O f '(y + 2 + O m 
(x+y + z+2£)P + ™ 


(1)(2)(3) 








m+P—1 e 


r 1 ,„(a:+5) m [(H + C)(z + e)]' 3 (a: + H + 2e)" 


n=0 


( a:+ j / + 25)/3 + m ( ;z . + H + z + 3C )/3 + r 1 


(2)(1)(3) 




Tf- 


is 


m+/3-l e 


m ,r,(!/+«) m [(^+«)(z+«)]' 3 (^ + !/+2?)" 




n=0 


(x+y + 2t)P + ™(x+y + z + 3t)P + ™ 


(1)(3)(2) 


a 


Tf~ 




m + /3-l e 


m, n [x + i) m [(y + i i )(z + l i )f(x + z + 2( i ) n 




n = 


(x + z + 2Z)P + ™(x+y + z + 3Z)P + n 


(2)(3)(1) 


u 


Tf' 


is 


m+P-1 e 


m ,»(!/+0'"^+fl(^0] /i (!/+2+2f)™ 




n=0 


(y + z + 2l;)/3 + ™( ;C + 2/ + Z + 30< 3 + ™ 


(3)(1)(2) 


U 


Tf- 


ioE 


m + /3-l e 


m ,r l (z+i) m [(x + ^(y + ^f(x + z + 2ir 




n = 


(x + z + 2Z)t ) + ™(x+y + z + ZZ)f i + ™ 


(3)(2)(1) 




Tf- 




m+/3-l e 


m , re (z+e) m [(^+€)(a+«)]' 3 (i/+2+2€)" 




n=0 


(y + z + 2l;)< 9 + m (: C + i, + Z + ;iO' 3 + ' 1 



A strictly positive multivariate polynomial <j>(x, y, z) is required that can 



convert the left-hand-side of (18) into a polynomial by the canceling of de- 
nominator factors. Specifically, 4> = <j)x4>2 where (j>i(x,y,z) controls factors 
in center^) and (p2(x,y,z) controls factors in P OI d(r])- Inspection suggests 
taking (pi(x,y,z) equal to 

(x + y + z + 0? +2a [(x + y + 0(x + z + t)(y + z + £)f +a [(* + 0(v + 0(* + Of 
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and 4>2(x, y, z) equal to 

(x + y + z + 20 W+a ~ 1 {{x + y + 2£)(x + z + 2£)(y + z + 2^~ 1 
x(x + y + z + 30 313 ' 2 

Observe that the degree of x in the polynomial 4> = 4>i4>2 is 13/3 + 5a — 5. 
Indeed this is also the degree of y and the degree of z by symmetry. These 
degrees are reduced in the polynomial f v = (ft(x, y, z) center^) P OT d(v) by 
factors in the denominators of center (ry) and P rd(v)- For example, if r] = 
(12) (3), then 

/„ = ( x + y + z + ^ +2a [(x + z + 0(y + z + t)f +a l(x + 0(y + Of 

x [(x + y + 2£)(x + z + 2£)(y + z + 2£)] 2/3 ~ 1 (x + y + z + 3£) 3/3 ~ 2 

X J] em ( z + ^^ x + y + ^rn^ x + y + z + 2 ^ + a-l- m ^ 
m=0 

which is a polynomial of degree 11/3 + 4a — 5, in both x and y, and of degree 
12/3+5a — 5 in z. A similar construction is possible for all structures; Table|4] 
records the degrees of x, y, and z in all component polynomials f v . 

Table 4 

Degree of x,y, and z in the multivariate polynomials f v = 4>(x, y, z) center(ry) P rd(v)- 
Recall /3 — cto + a and both a and cto are positive integers. 



structure rj 


degree(:r) 


degree(j/) 


degree (z) 


(123) 


11/3 + 4a - 


4 


11/3 + 4a - 


4 


11/3 + 4a- 


4 


(12)(3) 


11/3 + 4a- 


5 


11/3 + 4a- 


5 


12/3 + 5a - 


5 


(3)(12) 


12/3 + 4a - 


5 


12/3 + 4a - 


5 


11/3 + 4a - 


5 


(13)(2) 


11/3 + 4a- 


5 


12/3 + 5a - 


5 


11/3 + 4a- 


5 


(2)(13) 


12/3 + 4a - 


5 


11/3 + 4a- 


5 


12/3 + 4a - 


5 


(1)(23) 


11/3 + 4a- 


5 


12/3 + 4a - 


5 


12/3 + 4a - 


5 


(23)(1) 


12/3 + 5a - 


5 


11/3 + 4a - 


5 


11/3 + 4a- 


5 


(1)(2)(3) 


10/3 + 5a - 


5 


11/3 + 5a- 


5 


12/3 + 5a - 


5 


(2)(1)(3) 


11/3 + 5a- 


5 


10/3 + 5a - 


5 


12/3 + 5a - 


5 


(1)(3)(2) 


10/3 + 5a - 


5 


12/3 + 5a - 


5 


11/3 + 5a - 


5 


(2)(3)(1) 


12/3 + Sa- 


5 


10/3 + 5a - 


5 


11/3 + 5a - 


5 


(3)(1)(2) 


il^ 5a- 


5 


12^ + 5a - 


5 


10/3 + 5a - 


5 


(3)(2)(1) 


12/3 + 5a - 


5 


11,5 + 5a- 


5 


10/3 + 5a - 


5 



Having introduced the multiplier <p, the linear independence (18) is equiv- 
alent to the assertion that polynomial equation 



(19) 



Crj f n (x, y, z) = for all x, y, z > 
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implies a v = for all rj. Fixing any y and z, the left-hand-side of equa- 
tion ( |19| ) is a polynomial in x, with degree 12/3 + 5a — 5, according to 
Table T4j Indeed terms associated with structures rj = (23)(1), (2)(3)(1), 
and (3)(2)(1) all contribute monomials with that highest power in x. The 
coefficient of x 12/3+5a ~ 5 , denoted d = d(a,y,z), equals 

a(23)(l)C(23)(l)/(23)(l) + «(2)(3)(1)C(2)(3)(1)/(2)(3)(1) + a (3)(2)(l)C( 3 )(2)(l)/(2)(3)(l) 

where /' indicates contributions from respective terms within /„. This co- 
efficient d must equal zero, for all y and z; after all, a degree 12/3 + 5a — 5 
polynomial can equal zero in x for at most that many x values, unless the 
coefficient d is exactly zero; and we are asking that it equal zero at all values 
of x. From this study of the high-power coefficient in x, we have reduced 
consideration to three structures and are able to focus on d = d(a, y, z) as a 
bivariate polynomial in y and z (Table [5]). 

Table 5 

Degrees of y, and z in three terms of the bivariate polynomial d(a,y,z). This is a subset 

of Tabled 



structure r; 


degree(y) 


degree(z) 


(23)(1) 

(2) (3)(1) 

(3) (2)(1) 


11/3 + 4a -5 
10/3 + 5a - 5 
11/3 + 5a - 5 


11/3 + 4a - 5 
11/3 + 5a -5 
10,5 + 5a - 5 



The initial argument focusing on the degree of x can be adapted to study 
other variables in Table [5} With degree of y equal to 11/3 + 5a — 5, for 
instance, it must be that the coefficient d'(z), say, of y ll P+ 5a - 5 equals zero 
for all z. After all, the polynomial can equal zero at at most 11/3 + 5a — 5 
y values, and we require it to be zero at all y. But all contributions to that 
coefficient are strictly positive, except possibly a( 3 )( 2 )(i), hence we conclude 
a (3)(2)(i) = 0- By the same token, working with the degree 11/3 + 5a — 5 term 
in z, it follows that a(2)( 3 )(i) = 0, which then forces ci(2 3 )(i) = 0, because 
we require d = overall. Three rows from Table [4] have been eliminated 
(i.e., forced a n = 0), all those in which the mean of the first variable is 
greater than the other two means. Next, return to the reduced table, and 
focus, say, on structures (13)(2), (1)(3)(2), and (3)(1)(2), in which the second 
variable has mean greater than the others. In doing so, three more coefficients 
a (i3)(2) = a (i)(3)(2) = a ( 3 )(2)(i) = are forced, and Table 2 is further reduced 
to seven rows. Then the argument is repeated to get a(i2)( 3 ) = a (i)(2)( 3 ) = 
a (2)(i)( 3 ) = 0) an d it remains to assess coefficients a v of the four structures 
in Table H 
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Table 6 
Final subtable 



structure r/ 


degree(:r) 


degree ( y) 


degree (z) 


(123) 
(3)(12) 
(2)(13) 
(1)(23) 


11/3 + 4a -4 
12/3 + 4a - 5 
12/3 + 4a - 5 
11/3 + 4a -5 


11/3 + 4a -4 
12/3 + 4a - 5 
11/3 + 4a -5 
12/3 + 4a - 5 


11/3 + 4a -4 
11/3 + 4a -5 
12/3 + 4a - 5 
12/3 + 4a - 5 



The argument is repeated in this domain, knowing that all but four terms 
in ( 19 ) have been eliminated. The degree of x is 12/3 + 4a — 5, and there are 
contributions from both r/ = (3)(12) and 77 = (2)(13). But then restricted 
to these rows we get ar 3 vi 2 ) = because the coefficient of x 12/3+Aa ~ 5 as a 
polynomial in y has degree 12/3 + 4a — 5. The remaining constants a v are 
similarly zero, completing the proof in the no-replicate (to = 1), three group 
(p = 3) case. 

The balanced three group case follows suit, noting that now x, y, and z 
are sums taken, respectively, across replicates in each of the three groups. 
The product statistic is not xyz, but anyway it is common to all components 
and thus cancels in the linear combination test function. The observation- 
related shape parameter a is replaced by ma. The two-dimensional (p = 2) 
case is simpler and is left as an exercise. □ 
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Appendix C. Strict concavity of log-likelihood and proof of Theorem 4- 
Let q denote the number of non-null structures, and consider the log-likelihood 
I (it) in ( |11[ ) to be on M 9 , with the null probability defined secondarily as 
ttq = 1 — YLn^m^v This way we need not invoke Lagrange multipliers to 
compute derivatives of By calculus, the q x q Hessian H of negative 
2nd derivatives of l(jr) has (ij)th entry 



H *3 ~ Yl 



\p{Xg\m) -p(%|?7o)] \p(Xg\Vj) ~ P(Xg\r]o)] 



\p(Xg)] 2 

fi( x g) fjV- 1 <! 



where p{x g ) is the marginal density obtained by mixing over structures, as 
in ([!]), and fi(x) = [p{x\r]i) — p(x\rjo)] /p{x). Now let a = (a v ) be a q— vector 
of constants. To determine curvature of the log-likelihood we consider the 
quadratic form 

q q 

a T Ha = ^2Y j a i a j ^2fi(x g )f j {x g ) 

i=l j = l g 

g \i=i ) 

= E^)] 2 

g 

where T a (x) = Yli=i a ifi( x )- Clearly a T Ha > regardless of a, and so H 
is non- negative definite and l(ir) is concave. To establish strict concavity 
requires that we show T a {x g ) = for all g if and only if a = 0. The following 
lemma shows that knowing T a {x g ) = for all G values x g is enough to force 
T a (x) = for all x, as long as G is sufficiently large. But then a = by the 
linear independence assumption, completing the proof. 

Lemma 1. Let ip(x) be a multivariate polynomial in x E W 1 , and let 
Xi,X%, . . . ,X m denote a random sample from a continuous distribution on 
W 1 . If m is at least as large as the number of monomials in tp, then, with 
probability one, ^{Xi) = for i = 1,2, ... , m implies ip{x) = for all x. 

Proof of Lemma 1. Every point Xi puts a linear condition on the space of 
coefficients of ip. It needs to be verified that these conditions are linearly 
independent. Suppose that the first k conditions are linearly independent, 
so the space of ?/>'s that are zero at X\, . . . ,Xk has dimension (number of 
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monomials in if)) minus k. Pick one such nonzero polynomial, and call it 
(j). Since <j> = is a set with positive codimension, we may assume (with 
probability one) that 0(X^ + i) is not zero. Then, if we impose the additional 
condition ip(Xk+i) = 0, the dimension of the solution space drops by at 
least one, hence it drops by one. Letting k increase from 1 torn completes 
the proof. □ 

Appendix D. Further details of numerical examples. The parameters 
a,«0i and v o were fixed at values obtained by first fitting the unordered 
gamma-gamma mixture model in EBarrays, without a null structure but 
otherwise allowing all possible unordered structures. Shape parameters were 
then rounded to the nearest positive integer (Table 7) and all three parame- 
ters were plugged into the EM procedure to fit the proposed mixture-model 
proportions. (Recall that -P rd(??) in [5] can be computed only for integer 
shapes, hence the rounding.) To simplify EM calculations in the four ex- 
amples having more than five groups, the full set of ordered structures was 
filtered to a reduced set based on the fitting of the unordered gamma-gamma 
model in EBarrays. Each ordered structure corresponds to exactly one un- 
ordered structure (a many to one mapping). If no gene had a high (greater 
than 0.5) probability of mapping to a given unordered structure, then we 
deemed all corresponding ordered structures to have 7r„ = 0. This approxi- 
mation is not ideal, since the Bayes's rule assignment for some genes may be 
one of the structures eliminated by forcing tt„ = 0. This affects only 29 genes 
out of the 17539 clustered in these four cases. It would not affect clustering 
by a high threshold. 

For all data sets we examined quantile-quantile plots and plots relating 
sample coefficient of variation to sample mean. Some model violations were 
noted, but largely the gamma observation model was supported. 

For the Edwards data we re-ran the EM algorithm for 10 cycles and up- 
dated shape parameter estimates via 2-d grid search in each cycle. Estimated 
shapes changed slightly; 784/786 genes received the same Bayes's rule cluster 
assignment. 

Computations were done in R on industry-standard linux machines. For 
the data sets analyzed, run times ranged from 6 to 860 CPU seconds per EM 
iteration, with a mean of 270 seconds. Run time is affected by the number 
of genes analyzed, the number of groups, and also the shape parameters and 
sample sizes. 
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Table 7 

Parameter estimates (not including mixing proportions) from the examples analyzed. The 
last column indicates the number of functional categories in GO and KEGG having at 
least five annotated genes, which were used in the development of Figure 4- KEGG was 
not available for GDS1937, and so this data set was not used in Figure 4- 



data set 


a 


Q() 




# GO/KEGG 


Edwards 


113 


1 


586.5 




GDS2323 


14 


1 


119.1 


3849/184 


GDS1802 


17 


1 


46.8 


3619/182 


GDS2043 


22 


1 


47.7 


3619/182 


GDS2360 


8 


1 


15.8 


3258/175 


GDS599 


12 


1 


0.01 


3180/159 


GDS812 


5 


1 


15.4 


3258/175 


GDS1937 


6 


1 


20.5 


NA 


GDS568 


10 


1 


37.1 


3258/175 


GDS2431 


4 


1 


67.6 


4085/188 


GDS587 


8 


1 


9999.2 


1876/127 


GDS586 


13 


1 


4566.2 


3258/175 



ported in part by grants R01 ES017400 and T32 GM074904 from the Na- 
tional Institutes of Health. 

References. 

[1] Campbell, E.A., O'Hara, L., Catalano, R.D., Sharkey, A.M., Freeman, T.C., 
AND Johnson, M.H. (2006). Temporal expression profiling of the uterine luminal 
epithelium of the pseudo-pregnant mouse suggests receptivity to the fertilized egg is 
associated with complex transcriptional changes. Human Reproduction, 21, 2495-2513. 

[2] Dahl, D.B. and Newton, M.A. (2007). Multiple hypothesis testing by clustering 
treatment effects. Journal of the American Statistical Association, 102, 517-526. 

[3] Dennis, B. and Patil, G.P. (1984). The gamma distribution and weighted multi- 
modal gamma distributions as models of population abundance. Mathamatical Bio- 
sciences, 68, 187-212. 

[4] Dorsum, K.A. and Ozeki, A. (2008). Semiparametric models and likelihood-the 
power of ranks. In, Proceedings of the Third Erich L. Lehmann Symposium. 

[5] Edgar, R., Domrachev, M., and Lash, A.E. (2002). Gene expression omnibus: 
NCBI gene expression and hybridization array data repository, Nucleic Acids Research, 
30, 207-210. 

[6] Edwards, M., Sarkar, D., Klopp, R., Morrow, J., Weindruch, R., and Pro- 
lla, T. (2003). Age-related impairment of the transcriptional response to oxidative 
stress in the mouse heart. Physiological Genomics, 13, 119-127. 

[7] Eisen, MB, Spellman, PT, Brown, PO, and Botstein, D (1998). Cluster analysis 
and display of genome-wide expression patterns. Proceedings of the National Academy 
of Sciences. 95, 14863-14868. 

[8] Fraley, C, and Raftery, A.E. (2002). Model-based clustering, discriminant anal- 
ysis, and density estimation. Journal of the American Statistical Association 97, 611- 
631. 

[9] Gelman, A., Carlin, J.B., Stern, H.S., and Rubin, D.B. (2004). Bayesian data 
analysis. 2nd edition. Chapman and Hall. 



32 NEWTON, CHUNG 



[10; 

[ii 

[12; 

[13 
[14 

[15; 
[16 
[17 

[18 
[19 

[20 

[21 

[22; 
[23 

[24 

[25; 

[26 
[27 

[28 
[29 
[30 



Girardot, F., Monnier, V., Tricoire, H. (2004). Genome wide analysis of com- 
mon and specific stress responses in adult drosophila melanogaster. BMC Genomics, 
5, 74. 

Grasso,L.C, Maindonald, J., Rudd, S., Hayward, D.C., Saint, R., Miller, 
D.J., AND Ball, E.E. (2008). Microarray analysis identifies candidate genes for key 
roles in coral development. BMC Genomics, 9, 540. 

Hastie, T., Tibshirani, R., and Friedman, J. (2001). The elements of statistical 
learning. Springer- Verlag, New York. 

Hubert, L. and Arabie, P. (1985). Comparing partitions. Journal of Classification, 
2, 193-218. 

Greenwood, M. and Yule, G.U. (1920). An inquiry into the nature of frequency 
distributions representative of multiple happenings with particular reference to the 
occurrence of multiple attacks of disease or of repeated accidents. Journal of the Royal 
Statistical Society, 83, 255-279. 

Holzmann, H., Munk, A., and Gneiting, M. (2006). Identifiability of finite mix- 
tures of elliptical distributions. Scandinavian Journal of Statistics, 33, 753-763. 

Hutchinson, T.P. (1981). Compound gamma bivariate distributions. Metrika, 28, 
263-271. 

Jensen, S.T., Erkan, I., Arnardottir, E.S., and Small, D.S. (2009). Bayesian 
testing of many hypotheses x many genes: a study of sleep apnea. Annals of Applied 
Statistics, to appear. 

Keles, S. (2007). Mixture modeling for genome- wide localization of transcription 
factors. Biometrics, 63, 10-21. 

Kendziorski, CM., Newton, M.A., Lan, H., and Gould, M.N. (2003). On para- 
metric empirical Bayes methods for comparing multiple groups using replicated gene 
expression profiles. Statistics in Medicine, 22, 3899-3914. 

Kendziorski, CM., Chen, M., Yuan, M. Lan, H., and Attie, A.D. (2006). 
Statistical methods for expression quantitative trait loci (eQTL) mapping. Biometrics, 
62, 19-27. 

Kschischang, R., Frey, B.J., and Loeliger, H.A. (2001). Factor graphs and the 
sum-product algorithm. IEEE Transactions on information theory, 47, 498-519. 

Lo, K. AND Gottardo, R (2007). Flexible empirical Bayes models for differential 
gene expression. Bioinformatics, 23, 328-335. 

Marioni, J.C, Mason, C.E., Mane, S.M., Stephens, M., and Gilad, Y. (2008). 
RNA-seq: An assessment of technical reproducibility and comparison with gene ex- 
pression arrays. Genome Research, 18, 1509-1517. 

McCullagh, P. AND Nelder, J. A. (1989). Generalized linear models, 2nd edition. 
Chapman and Hall. 

McLachlan, G.J. AND Basford, K.E. 1988. Mixture models: inference and appli- 
cations to clustering. Marcel Dekker, New York. 

McLachlan, G.J. and Peel, D. (2000). Finite mixture models. Wiley, New York. 

Medvedovic, M., Yeung, K.Y., and Bumgarner, R.E. (2004). Bayesian mixture 
model based clustering of replicated microarray data. Bioinformatics, 20, 1222-1232. 

Mortazavi, A., Williams, B.A., McCue, K., Schaeffer, L. and Wold, B. 
Mapping and quantifying mammalian transcriptomes by RNA-Seq. Nature Methods, 
5, 621-628. 

Newton, M.A., Noueiry, A., Sarkar, D., and Ahlquist P. (2004). Detecting 
differential gene expression with a semiparametric hierarchical mixture method. Bio- 
statistics, 5, 155-176. 

Newton, M.A., Quintana, F.A., den Boon, J. A., Sengupta, S., and Ahlquist, 



GAMMA RANKING 



33 



P. (2007). Random-set methods identify distinct aspects of the enrichment signal in 

gene-set analysis. Annals of Applied Statistics, 1, 85-106. 
[31] Parmigiani, G., Garett, E.S., Irizarry, R.A., Zeger, S.L. (eds) (2003). The 

analysts of gene expression data: methods and software Springer. 
[32] Rabiner, L.R. (1989). A tutorial on hidden markov models and selected applications 

in speech recognition. Proceedings of the IEEE, 77, 257-286. 
[33] Redner, R.A. and Walker, H.F. (1984). Mixture densities, maximum likelihood 

and the EM algorithm. SIAM Rev. 26, 195-239. 
[34] Rempala, G.A. and Pawlikowska, I. (2008). Limit theorems for hybridization 

reactions on oligonucleotide microarrays. Journal of Multivariate Analysis, 99, 2082- 

2095. 

[35] Robinson, M.D. and Smyth, G.K. (2007). Moderated statistical tests for assessing 
differences in tag abundance. Biomformatics, 23, 2881-2887. 

[36] Rossell, D. (2009). GaGa: a parsimonious and flexible model for high-throughput 
data analysis Annals of Applied Statistics, 3, in press. 

[37] Smyth, G.K. (2004). Linear models and empirical Bayes methods for assessing differ- 
ential expression in microarray experiments. Statistical Applications in Genetics and 
Molecular Biology, 3, no 1, article 3. 

[38] Sobel, M. AND Frankowski, K. (1994). The 500th anniversary of the sharing prob- 
lem (the oldest problem in the theory of probability) . American Mathematical Monthly, 
101, 833-847. 

[39] Speed, T. (ed) (2004). Statistical analysis of gene expression microarray data. Chap- 
man and Hall/CRC. 

[40] Stephens, M. (2000). Dealing with label switching in mixture models. Journal of 
the Royal Statistical Society, Series B, 62, 795-809. 

[41] Stern, H. (1990). Models for distributions on permutations. Journal of the American 
Statistical Association, 85, 558-564. 

[42] Storey, J.D. and Tibshirani, R. (2003). Statistical significance for genome-wide 
studies. Proceedings of the National Academy of Sciences, 100, 9440-9445. 

[43] Thalamuthu, A., Mukhapadhyay, I., Zheng, X., and Tseng, G.C. (2006). Eval- 
uation and comparison of gene clustering methods in microarray analysis. Bioinfor- 
matics, 22, 2405-2412. 

[44] Titterington, D.M., Smith, A.F.M., AND Makov, U.E. (1985). Statistical anal- 
ysis of finite mixture distributions. Wiley, New York. 

[45] Yakowitz, S.J. AND Spragins, J.D. (1968). On the identifiability of finite mixtures. 
Annals of Mathematical Statistics, 39, 209-214. 

[46] Yuan, M. and Kendziorski, C.M. (2006a). A unified approach for simultaneous 
gene clustering and differential expression identification. Biometrics, 62, 1089-1098. 

[47] Yuan, M. and Kendziorski, C.M. (2006b). Hidden Markov models for time course 
data in multiple biological conditions (with discussion). Journal of the American Sta- 
tistical Association, 101, 1323-1340. 



E-MAIL: 



newton@stat.wisc.edu 



E-MAIL: 



lchung@stat.wisc.edu 



